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, Abstract 

Q . We propose a shrinkage procedure for simultaneous variable selection and estimation in 

q_) ■ generalized linear models (GLMs) with an explicit predictive motivation. The procedure es- 

, timates the coefficients by minimizing the Kullback-Leibler divergence of a set of predictive 

distributions to the corresponding predictive distributions for the full model, subject to an h 
constraint on the coefficient vector. This results in selection of a parsimonious model with sim- 
ilar predictive performance to the full model. Thanks to its similar form to the original lasso 
problem for GLMs, our procedure can benefit from available Zi-regularization path algorithms. 
P_] ' Simulation studies and real-data examples confirm the efficiency of our method in terms of 

predictive performance on future observations. 

Keywords. Generalized linear models, Kullback-Leibler divergence, Lasso, Optimal prediction, 
■ Variable selection. 

1 Introduction 

> 

| A primary goal in statistics is to develop algorithms that predict future data well from past obser- 

vations. In regression problems where a large number of predictors are involved predictive accuracy 
in statistical modeling may depend to a large extent on model selection strategies. For generalized 
. linear models (GLMs), for example, a large number of potential predictors are often given in order 

to reduce modeling bias, and one then would like to select a smaller subset achieving some kind 
of optimality properties. Popular methods such as the lasso and its variants can achieve model 
selection consistency, i.e., if the true model was included in the model set under consideration, these 
methods would be able to identify (asymptotically) the true model. However, whether or not the 
true model exists is a controversial issue. For a real dataset, it is believed either that no true model 
exists or that the true model has an infinite number of parameters (Burnham and Anderson, 2002). 
In this paper we deal with the problem of estimation and variable selection for GLMs with the goal 
of prediction in mind. 

From the Bayesian perspective it is sometimes argued that the full model should be used to 
achieve the best prediction accuracy (Aitchison, 1975; Geisser, 1993). However, with many pre- 
dictors prior specification and elicitation may be difficult, and the full model does not have inter- 
pretability - a property that is often desirable for many statistical procedures - because it does not 
tell us which and how predictors affect the response. Another drawback of using the full model is 
that if there is a cost associated with data collection then it would be inadvisable to use all of the 
predictors. This motivates the idea of choosing a submodel whose predictive distribution is close 
to that of the full model. This idea has been somewhat recognized in the literature. Brown et al. 
(2002) look at Bayesian model averaging incorporating variable selection for prediction. Tran (2009) 
and Vehtari and Lampinen (2004) propose model selection methods based on Kullback-Leibler di- 
vergence from the predictive distribution of the full model to the predictive distributions of the 
submodels. These works are motivated by the idea of trading off between prediction accuracy and 
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parsimony. However, these methods are challenging to implement because searching over the whole 
model space is computationally infeasible. Like the idea of the lasso (Tibshirani, 1996), we overcome 
this problem by using l\ constraints on the coefficients. By doing this, we can enjoy the computa- 
tional advantages of the algorithms for convex optimization with l\ constraints. Unlike the lasso, 
however, our approach has an explicit predictive motivation which aims at selecting a useful model 
with high prediction accuracy. A related approach is considered by Nott and Leng (2010) based 
on Kullback-Leibler projections, motivated by earlier work of Dupuis and Robert (2003) although 
these approaches are not based directly on posterior predictive distributions. 

For a collection of N predictive distributions obtained from the full model, we write KL^ ( Mf u n 1 1 Mp ) , 
i = l,...,N for the Kullback-Leibler divergences from the predictive distributions of the model based 
on coefficient vector /3 to those of the full model. Our approach in its general form is to solve for (3 
the following optimization problem 

N 

min^KL i (M full ||M /3 ) + A||/3|| il 

13 i=l 

with A a shrinkage parameter as in the original lasso. The main contribution of the present paper 
is to motivate and develop such a procedure for variable selection and estimation in GLMs that 
(i) automatically simultaneously estimates the coefficients and selects significant predictors; (ii) 
achieves good prediction accuracy; (iii) is broadly applicable; (iv) is computationally efficient. This 
procedure will be called the predictive lasso or plasso for short. 

The plasso for GLMs will be presented in Section [2] Section [3] presents useful prior specifications 
which can facilitate computation. In particular, we discuss in more detail the plasso for linear models 
and extend our previous discussion to a weighted version of the basic approach. Simulation and real- 
data examples are presented in Section|4]to demonstrate the use of the plasso and to compare it with 
the adaptive lasso (Zou, 2006) in terms of predictive performance. Section [3] contains concluding 
remarks. 



2 The predictive lasso 

We consider the problem of estimation and variable selection for GLMs with potential covariates 
x = (x = l,x\,...,x p )' £ X and the response y &y. With a suitable link function g, g(E(y\xj) is 
assumed to be a linear combination of x 

g(E(y\x)) =/3 + Pixi + ... + f3 p x p = x'/3. (1) 

We assume that the covariates Xi are in their final forms, no further transformations are needed 
(i.e., for various reasons and in order to keep things simple, we restrict ourselves to the linear 
approximation ([I])). The sampling distribution of an observation A, = (x, y) then is assumed to 
have the following form 

p(A|/3, 0) = p(x)p(y\x, /3, <)>) oc p(x) exp f -L [ yi 9(x'/3) - b(9(x'(3))] ) , 

where (3^M P+1 , 0>O are the coefficient vector and scale parameter, respectively, and 9, a and 
b are known functions. If covariates x are not random, the density p(x) in the above expression 
can be omitted. We are concerned with the problem of simultaneous coefficient estimation and 
variable selection with the goal of prediction in mind. Like the lasso, we would like to develop a 
method for simultaneous variable selection and parameter estimation. However, unlike the lasso our 
approach has a more explicit predictive motivation, which aims at producing a useful model with 
high prediction accuracy. 

Given the past dataset D and certain priors for parameters (j3, </>) of the full model, the predictive 
distribution p(A\D) for a future observation A = (x,y) is given by 

p(A\D) = p(x\D)p(y\x, D) = p{x\D) / p(y\x, /3, <j>)p(f3, 0\D)d(3d0. (2) 
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We can assume that p(x\D) = p(x), i.e., future design points are independent of past data. We 
propose to estimate the coefficient vector (3 by solving the following optimization problem: 

where the tuning parameter r >0 and weights Wj >0 are chosen later. (As will become clear shortly, 
4> plays no role in this optimization problem, we can assume at the moment that <fi is known). 
Note that the objective function is the Kullback-Leibler divergence from p(A|/3,0) to the predictive 
distribution p(A\D). We refer to this procedure of estimating (3 through the optimization of ([3]) as 
the predictive lasso (plasso). 

Let {A t = (x t ,yt), t=l,...,T} be Markov chain Monte Carlo (MCMC) samples from the predictive 
distribution p( A | D). The integral in ^ then can be approximated by the average (l/7 1 )^^ =1 log[p(At \D) /p(A t |/3,<; 
and © becomes 

T p 

niin--]Tlogp(A t |/3,0) s . t . £ uy ^ | < r, (4) 

t=i j=i 



or more specifically 



T p 

mm-J2[b(0(xtP))-ytO(x' t [3)] s.t. < r. (5) 

t=i j=i 



This optimization problem is also equivalent to 

, T P 

min - J2 [WMP)) - Vte&tf*)) + A J2 *>i\Pi I (6) 
t=i j=i 

where A is a tuning parameter. Such an optimization problem is easier to deal with if the objective 
function is convex. The convexity of the objective function turns out to depend on the link function, 
and holds for most popular GLMs. 

Often, the integral in x is approximated by a sum over N points x[,...,x^ N . These points might 
not coincide with the observed design points, they "come from the future" (hence the superscript 
"f stands for "future"). For each x{ , let y{ be the mean of MCMC samples {yu,t = l,...,Tb} drawn 
from p(y{\x? ,D) - the predictive distribution of the future response y{ at design point x[ given 
past data D. Then, it is easy to see that © becomes 



N P 

min-5>(*(/3'*{)) -y{0(J3'x{)] + A$>jlftl- (7) 



Note that, under the squared error loss, y( is an estimate of the best prediction (w.r.t. the predictive 
distribution p(y{ \x{ ,D j) for the response at x{. As will be seen in Section [31 for linear regression 
with a convenient specification of priors there is no need to conduct MCMC because the predictions 
y[ = E(y( \xl ,D) have a closed form. 

We have approximated the integral over a; by a sum over N "future" points x{ , i = 1,...,N. 
Typically, these points are specified depending on the context and/or on the distribution p(x) over 
X. As a default implementation of our procedure, however, we propose to identify the future points 
x{ with the observed training points Xi, i = l,...,n. The reason behind this is that if the sample size 
n is large enough and the observed training points x^ were randomly selected from p(x), then by 
the law of large numbers the integral over x can be well approximated by the sum over Xi. In what 
follows therefore, if not otherwise specified, we consider the plasso for GLMs in the following form 

n p 

min-J2m<^)-y{0(x'M + ^Y, w ^\- ( 8 ) 

*=1 J=l 



3 



Note that the original (adaptive) lasso for GLMs is 

n p 

min -Y^mx's) - viOW)] + (9) 

U i=l 3=1 

The plasso in this form differs from the original lasso only in the way it replaces the observed 
responses y\ by the predictions y{ = E(y{\xi,D). Available routines to solve © then can be used 
for ©. 

We have not yet considered the issue of choice of the tuning parameters in the plasso. As the 
primary goal of the plasso is to predict the future, cross-validation is a very natural choice for 
estimating A. As in the adaptive lasso, the weights Wj can be assigned as l/l/Sjl with /3j the MLE 
of f3 j. In a Bayesian context it is also natural to consider /3j as the posterior mode. 



3 Some useful prior specifications 



Given the available routines to solve the optimization problem of form ([5]). all what we need to 
implement the plasso is to calculate the quantities y{ = E(y{\xi,D). To do so, in general, we first 
need to specify a useful prior for parameters, determine posterior distributions and then estimate 
y{ — E(y{ \xi,D) by MCMC or some other method. However, in some cases there is no need to 
conduct MCMC. We first present in this section a prior specification for linear models in which the 
predictions y{ have closed form. For genalized linear models, we present here two prior specifications. 
The first is adapted from I Chen and Ibrahim! (|2003l ) which is int erpretable in terms o f observables 
rather than parameters. The second one proposed recently by iGelman et al. ( 20081) is useful for 
routine applied use. 



3.1 Prior specification for linear models 

Consider the linear model 

y = Xf3 + e 

where y is the n- vector of responses, X is an nx (p+1) design matrix and e is an n- vector of iid 
normal errors with mean zero and variance a 2 . The (p+l)-vector f3 consists of unknown mean 
parameters and we consider the situation where a 2 is also unknown. Consider the conjugate prior 
specification (O'Hagan and Forster, 2004, Chapter 11) p((3,a 2 ) = p(a 2 )p(f3\a 2 ) in which p(a 2 ) is 
inverse gamma 

PK ' T(d/2) V ' Py 2a 2 ' 

and p(j3\a 2 ) is multivariate normal, N(m,a 2 V). With these priors the predictive distribution of 
a new observation A = (x,y) is p(A\D) =p(x\D)p(y\x 1 D) with p{y\x,D) =td+ n (x'j3,s 2 (l + x'Vx)^j 
where 



p = 


(X'X + V- 1 )- 1 (V- 1 m + X'y), 




V = 


(v-i + x'xy 1 , 




s 2 = 


a + m'V~ 1 m + y'y — (V^m + 


X'yYiV- 1 + X'Xy^V^m + X'y) 




n 


+ d-2 


h = 


(X'X)~ 1 X'y. 




w(x) — 


1 + x'Vx. 





Now consider the predictive lasso ((3|) where as usual the integral over x is approximated by 
a sum over N "future" points x{ . Then equivalently, we need to minimize (the scale (j) is now 



4 



re-denoted by a ) 

N r r - p 

J2J l-logp(y{\x{,f3,a 2 )\p(y{\x{,D)dy{ s.t. ^ ™iW < r. (10) 

Noting that 



3=1 



logp(l//|a:{,/3,o*) = -ilo g 27r<r 2 - ^(y/ - (xf)'/3) 2 , 
minimizing (|10[) is equivalent to minimizing 

|loga 2 + ^^^((yf-(xfy/3) 2 |xf,£>) s.t. 5># 3 -| < r. (11) 

a »=1 3=1 

With the closed form of the predictive distribution as a t-distribution we have 

E ([y( - {x{)'0?\x{,d) = s 2 w{x{) + ~ (4YP) 2 ■ 

Substituting this into (jlip we must minimize 

JV 1 N '\ N 2 

y log a 2 + — ^ s 2 ^ ) + — £ ((xf )'0 - (x{ )'/3) (12) 

i=l i=l 

subject to the constraint. Minimizing this as a function of (3 amounts as before to an ordinary lasso 
problem where the responses are replaced with the fitted values from the full model at the future 
design points x{ , i = l,...,N. With a non- informative prior and with the x{ as the observed design 
points Xi this is the ordinary lasso, since in this case (3 = f3 and for the least squares estimator 

n n n 

i—1 i—1 i—1 

where the first term on the right hand side does not depend on (3. 

If (|T2|) has been minimized with respect to /3 subject to the constraint to obtain an estimate 
/3 plasso (this in general depends on the constraint r but we suppress this in the notation) then 
substituting in /3 p i asso and minimizing with respect to a 2 gives 

Eli Var(yf \x{ , D) + £^ - (x{)'P pla£ 



(T 2 

plasso 



N 

2 



TV 



(13) 



The weighted version of plasso. One extension we can consider that gives different results to the 
ordinary lasso in the noninformative case with the x{ the observed design points Xi is the following. 
Suppose that instead of considering predictive distributions in our predictive lasso objective function 
where the variance does not depend on x we predict y{ with 

p(y{\(3,* 2 w(x{)) = N((x{Yf3,a 2 w(x{)) 

That is, we allow our normal form predictive distributions to have variances which vary in proportion 
to the true predictive variances in the full model Va.r(y{\x{ ,D). The standard deviation in the full 



model yVax(y{\x{,D) is often considered a more realistic estimate of the standard error, because 
it incorporates model uncertainty. We now consider minimization of 



N . 

I [-^ogp(y{\(3,a 2 w(x{))\p(y{\x{,D)dy } l 

i—1 
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subject to the constraint and following a similar argument to our previous one we must minimize 



subject to the constraint in order to estimate (3. This is similar to before, but now with weights 
of \/w[x{) for the different design points. We will refer to this producdure as the weighted plasso 
(wplasso). After /3 has been estimated as /3 wp i asso say, the minimization with respect to a 2 gives 



a. 



wplasso 



Ef=i^Var( y /|xf,D) + E^ 1 ^7y(K / )^-K / )^ wpla 



N 



Ef =1 s 2 +Er =1 ^((*f)^fm 



wplasso 



N 



Elicitation of hyperparameters. We now discuss on the choice of the hyperparameters m and V. 
There are ma ny different ways proposed for choosing the matrix V in the literature. For example, 
IZellnerl (|l98fih proposed th e so-called q-prior in which V is set equal to c(X'X) 1 with some c>0 



(c—n is a common choice). iRafterv et al.l ([19970 proposed an alternative where V is a block-diagonal 



matrix. For noncategorical covariates, V is a diagonal matrix diag(s 2 ,K 2 s|f 2 ,...,ft 2 s~ 2 ) where s 2 is the 
sample variance of y, and s\ are the variances of the columns of X. For a categorical covariate, the 
corresponding diago nal element will be a matrix induced from the corresponding dummy variables. 
Rafterv et al . (1997) proposed a value of 2.85 for k together with a — 0.72 and d — 2.58. For the 
parameter m, they proposed the default value of m=(/3Q )LS ,0,...,0)' where ft® 1 " 3 is the OLS estimate 
of /?o- An alternativ e is m = 0. T h ese tw o choices of m often lead to very similar inferences. We 



will use the setup of IRafterv et al. ( 1997 ) in our following numerical examples 



3.2 Prior specifications for generalized linear models 



There is an extensive literature on prior specifications for GLMs. We will briefly present here two 
of them: the f irst on e is due to IChen and Ibrahiml (|2003l ) and the second is proposed recently by 
Gelman et all (j2008h . 



The Chen and Ibrahim prior. Recall that the sampling distribution of observables y=(yi,...,y n ) 
in the GLM case is 



1 



p(y\X, (3, 4>) oc exp ( £ — [ W 0(as</3) - b[0{x'M] ] = exp ( — [y'd - l'b(9)] 



1 



a{4>) 



where 6 = 6(fl) = (0i,...,0„)', i = 9(x' i P), 6(0) = (6(0i),...,6(0„))' and 1 is an n-vector of Is. For 
ease of exposition, we assume that <f> is kn own (and therefore s uppre ssed in the notation), as, for 
example, in logistic and Poisson regression. IChen and Ibrahiml (l2003h proposed the following prior 
for P 

K/3)ocexp( 7 o^K0-l'&(0)]) (14) 

where 70 > and a £ St n are hyperparameters determined later on. Denote this distribution by 
/3|(/>~£)(7o,Q!o)- They proved that the prior (|14[) is proper and that this prior is conjugate with the 
posterior /3|A,y-£i(l+7o,(7oao + y)/(l+7o))- 

As shown by IChen and Ibrahiml (|2003l ). E(y) =cxq, it is natural to choose c*o as a prior guess 
for E(y). Therefore, in practice, ao should be obtained from experts in the field although default 
empirical Bayes alternatives such as choosing ao as the fitted values based on the MLE or other 
methods are also possible. The parameter 70 weighs the importance of the prior guess. In general, 



G 



7o should be taken such that 7o = 7o(^) — ^0 as n— >oo, i.e., the prior has less influence when more 
data is available. An advantage of this prior specification is that it is interpretable in terms of 
observables rather than parameters which are sometimes not easy to elicit. 



The Gelman et al. prior. iGelman et ail f|2008h proposed a weakly informative prior distribution 



for GLMs, constructed by first standardizing the covariates to have mean zero and standard devi- 
ation 0.5, and then putting independent t— distributions on the coefficients. As a default choice, 
they recommended a central Cauchy distribution with scale 1 for the intercept and central Cauchy 
distributions with scale 2.5 for other coefficients. As argued bv lGelman et al. ( 20081 ). this prior spec- 



ification has many advantages; besides, it works in an automatic fashion with no hyperparameter 
elicitation needed. 

Recall that all what we need to implement the plasso is to calculate the quantities y{ = 
E(y{\xi,D). After the prior has been specified, y{ can be estimated by MCMC or some other 
method. It is well-known that 

E(y\X,/3) = b(d) = (b(0 1 ),...,b(6 n ))', 

so that 



y 



f =E(yt\X,y) =E P \ XtV [E(yf\X,P)] = E p \ Xt y [&(0(/9))] (15) 

which can be easily estimated by MCMC samples from the posterior distribution f3\X,y. 

A procedure for fitting GLMs with the Gelman et al. prior has been implemented in R. In 
the following numerical examples for logistic regression where no expert advice is available, we use 
the default prior of Gelman et al. For high-dimensional cases where using MCMC may be time 
consuming, we suggest using the plug- in predictive density to estimate the predictions y( . Our 
experiences show that this is very fast compared to MCMC. 



4 Experiments 

In this section, we study the plasso through simulations and real-data examples. We use the 
convenient prior specifications as in Section [3] The tuning parameter A is selected by 5-fold cross- 
validation. The examples are carried out using R with the help of the R packages glmnet and 
arm. 

A popular measure of predictive ability is the partial predictive score (PPS) (Good, 1952; Geisser, 
1980; Hoeting et al., 1999). Suppose that the data is split into two parts, the training set D T and the 
prediction set D p . The partial predictive score of the distributions induced by model parameters 
((3*,<j)*) is defined as 

PPS = ~Lt^ ^ logp(y\x,(3*,<t>*). 
It is understood that smaller PPS means better predictive performance. 

Example 1: A simulation study for linear regression. Consider the following linear model 

y = 2 + x'/3 + o 6 (16) 

where (3 — (3, 2, 0, 0, 0.5, 0.5, 0, 0)' (so that there are some main and also small effects), e is 
iid iV(0,l), and a > is the noise level. We want to compare the predictive performance of the 
plasso and the wplasso to that of the adaptive lasso (alasso). In our first simulation study, design 
points Xj arc simulated from a multivariate normal distribution As(0,E) with (7^=0.5^^^. We first 
generate from model (fT6|) a dataset which serves as the training set D T . Another dataset D p then is 
generated, which is used to test the predictive performance. Table[T]presents the PPS (after ignoring 
the constants independent of models) and numbers of zero-estimated coefficients averaged over 500 
replications with various factors n T (size of training set), n p (size of prediction set) and a. In each 



7 



TIT = Tip 


(7 


alasso 


plasso 


wplasso 


50 


1 


0.7775(0.1684) 
4.0200(0.9281) 


0.6066(0.1395) 
2.3440(1.2201) 


0.6100(0.1290) 
2.5240(1.1472) 


2 


1.4592(0.1830) 
5.3000(0.9674) 


1.3006(0.1535) 
2.8800(1.3643) 


1.2949(0.1422) 
3.1760(1.3344) 


3 


1.8555(0.1613) 
5.7560(0.7600) 


1.6968(0.1390) 
3.3900(1.4457) 


1.6922(0.1317) 
3.6540(1.4051) 


100 


1 


0.6871(0.1143) 
3.8200(0.6069) 


0.5536(0.0842) 
2.2440(1.1537) 


0.5544(0.0828) 
2.3340(1.1424) 


2 


1.3713(0.1056) 
4.9920(0.9432) 


1.2458(0.0857) 
2.5260(1.3239) 


1.2463(0.0845) 
2.6180(1.2975) 


3 


1.7719(0.1110) 
5.5660(0.7687) 


1.6488(0.0840) 
2.8120(1.4481) 


1.6478(0.0830) 
2.9320(1.4337) 


200 


1 


0.6252(0.0721) 
3.8360(0.4260) 


0.5226(0.0525) 
2.1020(1.1446) 


0.5234(0.0524) 
2.1740(1.1447) 


2 


1.3199(0.0739) 
4.5200(0.8503) 


1.2197(0.0576) 
2.3100(1.1561) 


1.2197(0.0575) 
2.3820(1.1518) 


3 


1.7236(0.0688) 
5.4600(0.7833) 


1.6244(0.0547) 
2.4760(1.2508) 


1.6243(0.0545) 
2.5280(1.2411) 



Tabic 1: PPS and numbers of zero-estimated coefficients averaged over 500 replications with normal 
predictors. The numbers in parentheses are standard deviations. 



case, the first row gives PPS and the second the numbers of zero-estimated coefficients. The numbers 
in parentheses are standard deviations. The results suggest that the plasso and wplasso have better 
predictive ability than the alasso. As one may expect for predictively motivated methods, models 
selected by the plasso and wplasso are less sparse than selected by the alasso. 

In our second simulation study, design points Xj are simulated from a multivariate ^-distribution 
with degrees of freedom being 1.5. By doing so, we intend to simulate situations in which some 
predictors have high leverage points, i.e. their distributions have long tails. The simulation result 
is presented in Table [21 As one may expect, the wplasso works better and more stable than the 
orthers because the variance is modeled to vary in proportion to the true predictive variance. 

In our last simulation study, we try a high-dimensional example. We consider the linear model 
(|16p with p— 100 and most of the coefficients are zero except f3j — 5, j — 10,20,. ..,100. The result 
reported in Table [3] suggests that in general the plasso and wplasso compare favourably with the 
alasso in this example. 

Example 2: Linear regression - predicting percent body fat. Percentage of body fat is one 
important me asure of health, which can be accurately estimated by underwater weighing techniques 
(|Bailevlll994l ). These techniques often require special equipment and are sometimes not convenient, 
thus fitting per cent body fat to simple body measurements is a convenient way to predict body fat. 
IJohnsonl (|l996h introduced a dataset in which percent body fat and 13 simple body measurements 
(such as weight, height and abdomen circumference) are recorded for 252 men. After omitting 
observations 39 (because a weight value of 363.15 pounds is unusually large), 42 (because a height 
value of 29.5 inches is unreasonable), and 182 (because the response value is 0), we obtain a dataset 
of size 249. 

We are concerned with the problem of constructing a model that predicts the response from the 
covariates. Following Hoeting et al. (1999), we use a linear regression model. The primary goal is 
prediction accuracy for future observations; besides this, parsimony is another important objective, 
since a simple model is preferred for the sake of scientific insight into the x — y relationship. 

Using the full dataset, the alasso, plasso and wplasso estimates of f3 are given in Table |4j 
These methods simultaneously do parameter estimation and variable selection, because some of the 
estimated coefficients are exact zero. Recall that the goals at which the methods aim are somewhat 
different: plasso and wplasso have a more explicit predictive motivation; besides, the wplasso in 
some cases is somewhat more realistic in the sense that it allows the variances to vary in proportion 
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riT = np 


a 


alasso 


plasso 


wplasso 


50 


1 


5.1060(35.720) 
2.9860(1.1010) 


2.3525(13.262) 
1.7120(1.2280) 


0.7268(0.2619) 
2.0340(1.1999) 


2 


4.6046(21.094) 
3.5340(1.1898) 


2.2795(4.8080) 
1.9140(1.1702) 


1.3623(0.2467) 
2.2560(1.2319) 


3 


4.4344(22.570) 
4.0260(1.4260) 


3.5535(23.491) 
2.3320(1.3555) 


1.7629(0.2079) 
2.6360(1.3383) 


100 


1 


3.3895(39.616) 
3.0860(0.9963) 


0.9106(2.2240) 
1.6100(1.2136) 


0.6270(0.1969) 
1.8360(1.2329) 


2 


3.5369(20.497) 
3.2380(1.0195) 


1.4987(0.8000) 
1.8480(1.1966) 


1.2759(0.1244) 
2.0860(1.2139) 


3 


2.6614(3.5284) 
3.4680(1.1899) 


3.4847(29.991) 
1.9880(1.2405) 


1.6794(0.1497) 
2.1920(1.2158) 


200 


1 


1.6475(10.968) 
3.1900(0.9569) 


0.7445(1.1904) 
1.3600(1.2703) 


0.5858(0.2033) 
1.6740(1.2533) 


2 


1.9182(4.1911) 
3.0960(1.0024) 


1.5461(3.0734) 
1.6020(1.2010) 


1.2510(0.1717) 
1.8660(1.1979) 


3 


2.3556(4.5199) 
3.3240(1.0262) 


1.8363(1.2353) 
1.8640(1.2134) 


1.6384(0.0662) 
2.1400(1.1896) 



Table 2: PPS and numbers of zero-estimated coefficients averaged over 500 replications with long- 
tailed i-distribution predictors. The numbers in parentheses are standard deviations. 



n T = n P 


a 


alasso 


plasso 


wplasso 


50 


1 


6.6472(51.647) 
76.440(5.0352) 


3.4224(6.9586) 
61.940(7.2292) 


2.5732(0.1725) 
68.320(8.3188) 


2 


8.9814(51.873) 
83.940(9.4159) 


7.0791(20.430) 
73.085(10.190) 


2.9250(0.4229) 
80.805(9.8439) 


100 


1 


1.3247(1.1160) 
77.105(19.757) 


1.0808(1.3710) 
61.465(17.118) 


1.0307(0.0861) 
74.315(10.851) 


2 


2.1953(2.7703) 
74.525(19.389) 


2.2110(2.5382) 
61.160(19.573) 


1.6085(0.1005) 
74.240(11.700) 


200 


1 


0.9484(0.1250) 
89.155(1.0802) 


0.6047(0.0651) 
66.420(9.7709) 


0.6717(0.0504) 
75.235(6.2975) 


2 


1.6169(0.1059) 
89.110(1.1064) 


1.3039(0.0662) 
66.650(9.8158) 


1.3182(0.0505) 
75.300(6.1325) 



Table 3: PPS and numbers of zero-estimated coefficients averaged over 500 replications for the 
large-p case. The numbers in parentheses are standard deviations. 
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tull data 






case 1 






case 11 






case 111 






al 


Pi 


wpl 


al 


Pi 


wpl 


al 


Pi 


wpl 


al 


Pi 


wpl 


c 


-18.00 


6.79 


-0.18 


-14.78 


2.88 


-0.28 


-15.69 


-2.95 


-4.59 


-23.31 


-0.61 


-3.87 


x x 





0.06 


0.04 


0.02 


0.09 


0.08 




















x 2 






































x 3 


-0.20 


-0.29 


-0.27 


-0.26 


-0.40 


-0.39 





-0.17 


-0.14 





-0.24 


-0.22 







-0.30 


-0.11 





-0.24 


-0.17 














-0.34 


-0.25 


x 5 





-0.09 
































x 6 


0.55 


0.78 


0.68 


0.55 


0.70 


0.68 


0.38 


0.66 


0.66 


0.45 


0.69 


0.69 


x 7 





-0.09 








-0.09 























X s 





0.09 








0.16 


0.08 




















x 9 














0.09 























Xio 





0.09 








0.22 


0.17 





-0.39 


-0.43 





-0.04 





X n 





0.13 


0.04 














0.10 


0.10 





0.20 


0.20 


X\2 





0.19 


























0.19 


0.07 


-^13 





-1.62 


-1.31 





-1.34 


-1.20 





-1.16 


-1.15 





-1.44 


-1.35 


PPS 




1.946 


1.933 


1.933 


2.112 


1.913 


1.902 


2.075 


1.965 


1.951 



Table 4: Predicting percent body fat. The abbreviations "al", "pi" and "wpl" stand for alasso, 
plasso and wplasso, respectively. 



to the predictive variance of the full model. 

We now examine the predictive performance of these three procedures. To this end, we split 
the dataset into two parts: the first 125 observations are used as the training set D, the remaining 
observations are used as the prediction set D p . The alasso, plasso and wplasso estimates and their 
PPS are given in Table [4] (case I). As the second examination, the first 125 observations are used 
as the prediction set D p , the remaining observations are used as the training set D, For the third 
examination, we randomly split the full dataset into two (roughly) equal parts which serve as the 
training and prediction sets. The estimates and PPS are summarized in TableSJ As one may expect 
for predictively motivated methods, the variables selected by plasso and wplasso in general contain 
those selected by alasso, i.e., the models selected by plasso and wplasso are bigger than the one 
selected by alasso. In all cases, the plasso and wplasso show a better predictive performance over 
the alasso. It seems that modelling the variances to vary in proportion to the predictive variance of 
the full model is appropriate in this example, because the wplasso has a similar or better predictive 
performance compared with the plasso. 

Example 3: Logistic regression - the Chapman data. This dataset consists of 6 features 
of 200 men as well as binary resp o nses i ndicating the presence of heart disease. This dataset is 
described in detail in Christensenl (1997). We use logistic regression to explore the relationship 



between the presence of heart disease and the features. 

Table [5] presents the result of analysis. The quantities y{ are estimated from MCMC samples 
as in (fT"5|) with 10000 draws after discarding 1000 burn-in samples. To examine the predictive 
performance, we use the first half of observations as the training set, the rest as the prediction set. 
In the second case, we use the last 100 observations as the training set and the first 100 for testing of 
predictive performance; and for the last case 100 observations are randomly selected for the training 
set. In all cases, as shown in Table [SJ plasso shows a better predictive performance over alasso. 

Example 4: Logistic regression - the spambase data. We consider in this example another 
application of the predictive lasso in the logistic regression framework that is more challenging with 
many predictors and instances. We consider the spam email data set created by Mark Hopkins, Erik 
Reeber, George Forman and Jaap Suermondt at the Hewlett-Packard Labs. The data set consists 
of 4061 messages, each has been already classified as email or spam together with 57 attributes 
(predictors) which arc relative frequencies of commonly occurring words. The goal is to design a 
spam filter that could filter out spam before clogging the users' mailboxes. Our goal as usual is to 
construct a parsimonious model with a good prediction accuracy. 

With a large number of predictors and observations, using MCMC may be time consuming so 
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Ml data 


case 


1 


case 


11 


case 


111 




alasso 


plasso 


alasso 


plasso 


alasso 


plasso 


alasso 


plasso 


intercept 


-7.86 


-5.10 


-1.73 


-8.63 


-2.92 


-0.87 


-4.2y 


-3.42 


Xi 


0.05 


0.04 





0.04 


0.02 


0.05 


0.02 


0.04 


x 2 











0.01 














x 3 

















-0.01 


































x 5 





-0.06 





-0.03 





-0.08 





-0.08 


x 6 


0.01 


0.02 





0.02 





0.01 


0.01 


0.02 


PPS 




0.3533 


0.3233 


0.4125 


0.3762 


0.3452 


0.3302 



Table 5: Example 3: the Chapman data. 



that we use the plug in method discussed earlier. We randomly split the data set into two parts 
(training set and prediction set). Both alasso and plasso select a subset of 34 predictors out of 57 
with the PPS are 0.2604 and 0.2538 respectively. In our second try, the alasso selects 28 predictors 
and gives a PPS of 0.2710 while the plasso selects 30 predictors and gives a PPS of 0.2562. Repeat 
this many times, we observe that the plasso often select several predictors more than the alasso and 
that the plasso almost always has a smaller PPS. 

5 Conclusion 

The popular lasso as a procedure for simultaneous variable selection and estimation has many 
attractive properties, and under certain conditions is able to identify the true model if it is assumed 
to exist. Our suggested plasso has a more explicit predictive motivation which aims at selecting a 
useful model for prediction; besides, it enjoys the attractive properties of lasso. A notable feature 
of plasso is that we put no restriction on the predictive distribution p(A\D). Although we have 
considered p(A\D) as arising from a full model including all potential covariates, it can in fact arise 
from any model where a GLM approximation with variable selection is desired. The approximation 
can also be an appropriately local one in the covariate space through a judicious choice of the design 
points in the plasso criterion, which need not correspond to the observed design points. In this paper 
we have motivated and developed the idea of plasso only for GLMs. It is clear that this idea can 
be extended to other models rather than GLMs, and this is a topic for future research. 
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